APDL Eductation Resources

home *** CD-ROM | disk | FTP | other *** search

/ APDL Eductation Resources / APDL Eductation Resources.iso / programs / misc / seek / _seek / technical < prev next >

Wrap

Text File | 1995-03-07 | 7.8 KB | 245 lines

This document contains some technical information about SEEK. Programmers who want to write their own programs to perform operations on the text, will find pertinent information here. Memory Problems =============== In order that this program be useable on 1Mb machines, I have set the WimpSlot parameter quite tight. I'm a little worried that some configurations might occasionally run out of memory. If you do encounter memory related errors like "Too many nested procedures" or "no room for procedure call", then edit the !Run file and increase the WimpSlot parameter. The compression algorithm. ========================== Firstly there is a file (WORDSORT) which contains all the words used in the entire set of text files, sorted into alphabetical order. It also contains the number of occurrences of each word. You can read WORDSORT like:- file%=OPENIN("WORDSORT") REPEAT INPUT#file%,word$,count% . . . UNTIL EOF#file% The compressed file contains a number of 16-bit tokens, which can be one of five types:- Punctuation mark (after word): bits 0-1: Word Type = 3 (special) bits 2-7: Marker = 0 bits 8-15: Punctuation Character in ASCII Punctuation mark (before word): bits 0-1: Word Type = 3 (special) bits 2-7: Marker = 1 bits 8-15: Punctuation Character in ASCII Verse mark: bits 0-1: Word Type = 3 (special) bits 2-7: Marker = 2 bits 8-15: Verse Number Chapter mark: bits 0-1: Word Type = 3 (special) bits 2-7: Marker = 3 bits 8-15: Chapter Number Word Token: bits 0-1: Word Type 0 = lower case 1 = Initial Capital 2 = ALL UPPER CASE bits 2-15: Word Number Whenever a new chapter starts, there will be a chapter token. Whenever a new verse starts, there will be a verse token. Any characters other than alphabetic characters and apostrophies are represented by punctuation tokens. There are two classes of punctuation tokens, punctuation that follows a word, e.g. THIS, THAT! THESE. and punctuation that precedes a word, e.g. "THIS (THAT The actual words of the text are represented by a 14-bit word number. The maximum word number is 16,384. E.g. word #1 is "a", word #2 is "aaron", word #3 is "aaron's". If the word appears with a leading capital letter, then the word type is set to 1. If all the letters in the word are upper case, then the word type is set to 2. Word type 3 is for non-word tokens. The words are always separated by spaces. There is no token for space, since the program knows that there is always a space after a word. The position of the space is after all the type-0 punctuation and before any type-1 punctuation. This means that when decoding the text, you need to look at the next token before you can decide if a space is required at the current position. Let's look at an example: They said unto him, Rabbi, (which is to say, being interpreted, Master,) where dwellest thou? This becomes WORD CAPITALISED They WORD said WORD unto WORD him PUNCTUATION AFTER , WORD CAPITALISED Rabbi PUNCTUATION AFTER , PUNCTUATION BEFORE ( WORD which WORD is WORD to WORD say PUNCTUATION AFTER , WORD being WORD interpreted PUNCTUATION AFTER , WORD CAPITALISED Master PUNCTUATION AFTER , PUNCTUATION AFTER ) WORD where WORD dwellest WORD thou PUNCTUATION AFTER ? VERSE MARK So the text is compressed from the 94 bytes of plain text (plus a bit for he chapter & verse numbers) to 24 16-bit tokens, i.e. 48 bytes. The main objective of this compression is to improve word search speed. This is achieved as follows:- Suppose we want to find all verses containing both "Rabbi" and "interpreted". We proceed as follows:- Look up "rabbi" and "interpreted" in the word list. The word "rabbi" occurs 8 times, and "interpreted" occurs 11 times. Choose the least frequent word to be the primary search parameter, in this case "rabbi". The word number of "rabbi" is 1409. Load each file in turn into memory, and scan it. Look at each 16-bit token. If the word type is not 3, and the word number is 1409 then we have a match. It could be "rabbi", "Rabbi" or "RABBI" depending on the word type. Keep track of the last chapter and verse tokens during this search. When we find a verse containing "rabbi", jump back to the start of the verse, and scan the verse for word number 921 ("interpreted"). Only decompress the text once all the search keys have been satisfied. Suppose we want to find all verses containing both "archimedes" and "computer". Look up "archimedes" and "computer" in the word list. "archimedes" is not in the word list at all. Therefore, the word "archimedes" can't be in the text, so there's no point looking for it. Reply immediately: SEARCH COMPLETE - NO OCCURRENCES FOUND. Timing Considerations ===================== A 3-word search takes almost exactly the same time as a 1-word search that produces the same number of hits. The more hits that are found, the longer it takes, because we decompress the verse when we have a hit, and update the progress window. The primary search algorithm is written in ARM code, but everything else is written in BASIC. In Mode-27, on an A5000, using an IDE hard disk, a scan of the whole Bible takes 21 seconds. Each verse found adds 0.07 seconds. Memory Considerations ===================== I want it to be usable on a 1 Mb machine. In order to achieve this, I'm limiting the maximum number of output lines to 1000. Each verse can occupy several lines. 1000 output lines might be about 400 verses. Configuration File ================== With the data files, there is a file called "Config" this contains information about the special effects and the files to be searched. Special Effects =============== There must always be three special effects. Each effect is controlled by three lines of information, these lines contain:- The effect name - this will appear in the FORMAT window The string which switches the effect ON The string which switches the effect OFF For example:- Impress Super {script super} {script} This defines an effect called "Impress Super". When selected, verses will be saved like {script super}Jn 3:16{script} God so loved. . . If you load this into Impression, "Jn 3:16" will appear in superscript. Files to be searched ==================== The book files are defined as follows:- The order of the entries controls the order in which the files will be searched. So normally the entries will be in the order that the books occur in the Bible. Each entry contains: The abbreviation for the book (used for references) The filename The type of the book The full name of the book separated by commas Book types are used to control the range of the search. The types correspond to the icons in the selection window. Type 1: Pentateuch Genesis-Deuteronomy Type 2: History Joshua-Esther Type 3: Poetry Job-Song of Solomon Type 4: Major Prophets Isaiah-Daniel Type 5: Minor Prophets Hosea-Malachi Type 6: Gospels Matthew-John Type 7: Acts Acts Type 8: Letters Romans-Jude Type 9: Revelation Revelation The program is very sensitive to errors in the config file, so save a copy first, and be careful with those commas.